Python for Data Analysis

Manipulating, processing, cleaning, and crunching data in Python.

Preliminaries

Essential Python Libraries

NumPy (Numerical Python)
pandas
matplotlib
IPython
SciPy



In [1]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Introductory Examples

General tasks

Interacting with the outside world: Reading and writing with a variety of file formats and databases.
Preparation: Cleaning, munging, combining, normalizing, reshaping, slicing, and dicing, and transforming data.
Transformation: Applying mathematical and statistical operations.
Modeling and Computation: Connecting your data to statistical models, machine learning.
Presentation: Creating interactive or static graphical visualisations.

usa.gov dataset



In [2]:

    
data_path = 'data/usagov_bitly_data2012-03-16-1331923249.txt'
open(data_path).readline()









    Out[2]:





'{ "a": "Mozilla\\/5.0 (Windows NT 6.1; WOW64) AppleWebKit\\/535.11 (KHTML, like Gecko) Chrome\\/17.0.963.78 Safari\\/535.11", "c": "US", "nk": 1, "tz": "America\\/New_York", "gr": "MA", "g": "A6qOVH", "h": "wfLQtf", "l": "orofrog", "al": "en-US,en;q=0.8", "hh": "1.usa.gov", "r": "http:\\/\\/www.facebook.com\\/l\\/7AQEFzjSi\\/1.usa.gov\\/wfLQtf", "u": "http:\\/\\/www.ncbi.nlm.nih.gov\\/pubmed\\/22415991", "t": 1331923247, "hc": 1331822918, "cy": "Danvers", "ll": [ 42.576698, -70.954903 ] }\n'



In [3]:

    
import json

with open(data_path, 'r') as data_file:
    records = [json.loads(line) for line in data_file]



In [4]:

    
len(records)









    Out[4]:





3560



In [5]:

    
type(records)









    Out[5]:





list



In [6]:

    
records[0]









    Out[6]:





{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
 u'al': u'en-US,en;q=0.8',
 u'c': u'US',
 u'cy': u'Danvers',
 u'g': u'A6qOVH',
 u'gr': u'MA',
 u'h': u'wfLQtf',
 u'hc': 1331822918,
 u'hh': u'1.usa.gov',
 u'l': u'orofrog',
 u'll': [42.576698, -70.954903],
 u'nk': 1,
 u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
 u't': 1331923247,
 u'tz': u'America/New_York',
 u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

Counting Time Zones in pure Python



In [7]:

    
type(records[0])









    Out[7]:





dict



In [8]:

    
records[0].keys()









    Out[8]:





[u'a',
 u'c',
 u'nk',
 u'tz',
 u'gr',
 u'g',
 u'h',
 u'cy',
 u'l',
 u'al',
 u'hh',
 u'r',
 u'u',
 u't',
 u'hc',
 u'll']



In [9]:

    
time_zones = [record['tz'] for record in records if 'tz' in record]



In [10]:

    
len(time_zones)









    Out[10]:





3440



In [11]:

    
time_zones[:10]









    Out[11]:





[u'America/New_York',
 u'America/Denver',
 u'America/New_York',
 u'America/Sao_Paulo',
 u'America/New_York',
 u'America/New_York',
 u'Europe/Warsaw',
 u'',
 u'',
 u'']



In [12]:

    
from collections import Counter
time_zone_counter = Counter(time_zones)
time_zone_counter









    Out[12]:





Counter({u'': 521,
         u'Africa/Cairo': 3,
         u'Africa/Casablanca': 1,
         u'Africa/Ceuta': 2,
         u'Africa/Johannesburg': 1,
         u'Africa/Lusaka': 1,
         u'America/Anchorage': 5,
         u'America/Argentina/Buenos_Aires': 1,
         u'America/Argentina/Cordoba': 1,
         u'America/Argentina/Mendoza': 1,
         u'America/Bogota': 3,
         u'America/Caracas': 1,
         u'America/Chicago': 400,
         u'America/Chihuahua': 2,
         u'America/Costa_Rica': 1,
         u'America/Denver': 191,
         u'America/Edmonton': 6,
         u'America/Guayaquil': 2,
         u'America/Halifax': 4,
         u'America/Indianapolis': 20,
         u'America/La_Paz': 1,
         u'America/Lima': 1,
         u'America/Los_Angeles': 382,
         u'America/Managua': 3,
         u'America/Mazatlan': 1,
         u'America/Mexico_City': 15,
         u'America/Monterrey': 1,
         u'America/Montevideo': 1,
         u'America/Montreal': 9,
         u'America/New_York': 1251,
         u'America/Phoenix': 20,
         u'America/Puerto_Rico': 10,
         u'America/Rainy_River': 25,
         u'America/Recife': 2,
         u'America/Santo_Domingo': 1,
         u'America/Sao_Paulo': 33,
         u'America/St_Kitts': 1,
         u'America/Tegucigalpa': 1,
         u'America/Vancouver': 12,
         u'America/Winnipeg': 4,
         u'Asia/Amman': 2,
         u'Asia/Bangkok': 6,
         u'Asia/Beirut': 4,
         u'Asia/Calcutta': 9,
         u'Asia/Dubai': 4,
         u'Asia/Harbin': 3,
         u'Asia/Hong_Kong': 10,
         u'Asia/Istanbul': 9,
         u'Asia/Jakarta': 3,
         u'Asia/Jerusalem': 3,
         u'Asia/Karachi': 3,
         u'Asia/Kuala_Lumpur': 3,
         u'Asia/Kuching': 1,
         u'Asia/Manila': 1,
         u'Asia/Nicosia': 1,
         u'Asia/Novosibirsk': 1,
         u'Asia/Pontianak': 1,
         u'Asia/Riyadh': 1,
         u'Asia/Seoul': 5,
         u'Asia/Tokyo': 37,
         u'Asia/Yekaterinburg': 1,
         u'Australia/NSW': 6,
         u'Australia/Queensland': 1,
         u'Chile/Continental': 6,
         u'Europe/Amsterdam': 22,
         u'Europe/Athens': 6,
         u'Europe/Belgrade': 2,
         u'Europe/Berlin': 28,
         u'Europe/Bratislava': 3,
         u'Europe/Brussels': 4,
         u'Europe/Bucharest': 4,
         u'Europe/Budapest': 5,
         u'Europe/Copenhagen': 5,
         u'Europe/Dublin': 3,
         u'Europe/Helsinki': 10,
         u'Europe/Lisbon': 8,
         u'Europe/Ljubljana': 1,
         u'Europe/London': 74,
         u'Europe/Madrid': 35,
         u'Europe/Malta': 2,
         u'Europe/Moscow': 10,
         u'Europe/Oslo': 10,
         u'Europe/Paris': 14,
         u'Europe/Prague': 10,
         u'Europe/Riga': 2,
         u'Europe/Rome': 27,
         u'Europe/Skopje': 1,
         u'Europe/Sofia': 1,
         u'Europe/Stockholm': 14,
         u'Europe/Uzhgorod': 1,
         u'Europe/Vienna': 6,
         u'Europe/Vilnius': 2,
         u'Europe/Volgograd': 1,
         u'Europe/Warsaw': 16,
         u'Europe/Zurich': 4,
         u'Pacific/Auckland': 11,
         u'Pacific/Honolulu': 36})



In [13]:

    
sorted_time_zone_counter = time_zone_counter.most_common()



In [14]:

    
sorted_time_zone_counter









    Out[14]:





[(u'America/New_York', 1251),
 (u'', 521),
 (u'America/Chicago', 400),
 (u'America/Los_Angeles', 382),
 (u'America/Denver', 191),
 (u'Europe/London', 74),
 (u'Asia/Tokyo', 37),
 (u'Pacific/Honolulu', 36),
 (u'Europe/Madrid', 35),
 (u'America/Sao_Paulo', 33),
 (u'Europe/Berlin', 28),
 (u'Europe/Rome', 27),
 (u'America/Rainy_River', 25),
 (u'Europe/Amsterdam', 22),
 (u'America/Phoenix', 20),
 (u'America/Indianapolis', 20),
 (u'Europe/Warsaw', 16),
 (u'America/Mexico_City', 15),
 (u'Europe/Paris', 14),
 (u'Europe/Stockholm', 14),
 (u'America/Vancouver', 12),
 (u'Pacific/Auckland', 11),
 (u'Europe/Moscow', 10),
 (u'Asia/Hong_Kong', 10),
 (u'America/Puerto_Rico', 10),
 (u'Europe/Helsinki', 10),
 (u'Europe/Prague', 10),
 (u'Europe/Oslo', 10),
 (u'Asia/Calcutta', 9),
 (u'Asia/Istanbul', 9),
 (u'America/Montreal', 9),
 (u'Europe/Lisbon', 8),
 (u'Europe/Athens', 6),
 (u'America/Edmonton', 6),
 (u'Chile/Continental', 6),
 (u'Asia/Bangkok', 6),
 (u'Australia/NSW', 6),
 (u'Europe/Vienna', 6),
 (u'Europe/Copenhagen', 5),
 (u'Europe/Budapest', 5),
 (u'America/Anchorage', 5),
 (u'Asia/Seoul', 5),
 (u'Asia/Dubai', 4),
 (u'Europe/Zurich', 4),
 (u'America/Halifax', 4),
 (u'Europe/Bucharest', 4),
 (u'America/Winnipeg', 4),
 (u'Europe/Brussels', 4),
 (u'Asia/Beirut', 4),
 (u'Asia/Karachi', 3),
 (u'Asia/Harbin', 3),
 (u'Asia/Jakarta', 3),
 (u'America/Bogota', 3),
 (u'Africa/Cairo', 3),
 (u'Asia/Jerusalem', 3),
 (u'Asia/Kuala_Lumpur', 3),
 (u'America/Managua', 3),
 (u'Europe/Bratislava', 3),
 (u'Europe/Dublin', 3),
 (u'Europe/Belgrade', 2),
 (u'Asia/Amman', 2),
 (u'America/Chihuahua', 2),
 (u'Europe/Riga', 2),
 (u'Europe/Malta', 2),
 (u'America/Recife', 2),
 (u'Europe/Vilnius', 2),
 (u'Africa/Ceuta', 2),
 (u'America/Guayaquil', 2),
 (u'Europe/Skopje', 1),
 (u'Asia/Manila', 1),
 (u'Africa/Casablanca', 1),
 (u'Asia/Kuching', 1),
 (u'America/Monterrey', 1),
 (u'Asia/Yekaterinburg', 1),
 (u'America/Lima', 1),
 (u'America/Mazatlan', 1),
 (u'America/Tegucigalpa', 1),
 (u'America/Argentina/Cordoba', 1),
 (u'America/St_Kitts', 1),
 (u'Asia/Nicosia', 1),
 (u'Europe/Volgograd', 1),
 (u'Europe/Uzhgorod', 1),
 (u'Africa/Johannesburg', 1),
 (u'America/Argentina/Buenos_Aires', 1),
 (u'America/Costa_Rica', 1),
 (u'America/La_Paz', 1),
 (u'Africa/Lusaka', 1),
 (u'America/Caracas', 1),
 (u'Asia/Riyadh', 1),
 (u'Europe/Ljubljana', 1),
 (u'America/Montevideo', 1),
 (u'Asia/Pontianak', 1),
 (u'Australia/Queensland', 1),
 (u'America/Santo_Domingo', 1),
 (u'Asia/Novosibirsk', 1),
 (u'Europe/Sofia', 1),
 (u'America/Argentina/Mendoza', 1)]

Counting Time Zones in pandas



In [15]:

    
df = pd.DataFrame(records)



In [16]:

    
df.head()









    Out[16]:






  
    
      
      _heartbeat_
      a
      al
      c
      cy
      g
      gr
      h
      hc
      hh
      kw
      l
      ll
      nk
      r
      t
      tz
      u
    
  
  
    
      0
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      en-US,en;q=0.8
      US
      Danvers
      A6qOVH
      MA
      wfLQtf
      1.331823e+09
      1.usa.gov
      NaN
      orofrog
      [42.576698, -70.954903]
      1.0
      http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/...
      1.331923e+09
      America/New_York
      http://www.ncbi.nlm.nih.gov/pubmed/22415991
    
    
      1
      NaN
      GoogleMaps/RochesterNY
      NaN
      US
      Provo
      mwszkS
      UT
      mwszkS
      1.308262e+09
      j.mp
      NaN
      bitly
      [40.218102, -111.613297]
      0.0
      http://www.AwareMap.com/
      1.331923e+09
      America/Denver
      http://www.monroecounty.gov/etc/911/rss.php
    
    
      2
      NaN
      Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...
      en-US
      US
      Washington
      xxr3Qb
      DC
      xxr3Qb
      1.331920e+09
      1.usa.gov
      NaN
      bitly
      [38.9007, -77.043098]
      1.0
      http://t.co/03elZC4Q
      1.331923e+09
      America/New_York
      http://boxer.senate.gov/en/press/releases/0316...
    
    
      3
      NaN
      Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...
      pt-br
      BR
      Braz
      zCaLwp
      27
      zUtuOu
      1.331923e+09
      1.usa.gov
      NaN
      alelex88
      [-23.549999, -46.616699]
      0.0
      direct
      1.331923e+09
      America/Sao_Paulo
      http://apod.nasa.gov/apod/ap120312.html
    
    
      4
      NaN
      Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
      en-US,en;q=0.8
      US
      Shrewsbury
      9b6kNl
      MA
      9b6kNl
      1.273672e+09
      bit.ly
      NaN
      bitly
      [42.286499, -71.714699]
      0.0
      http://www.shrewsbury-ma.gov/selco/
      1.331923e+09
      America/New_York
      http://www.shrewsbury-ma.gov/egov/gallery/1341...



In [17]:

    
df.tz.value_counts().head()









    Out[17]:





America/New_York       1251
                        521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Name: tz, dtype: int64



In [18]:

    
clean_tz = df.tz.fillna('Missing')
clean_tz[clean_tz == ''] = 'Unknown'



In [19]:

    
tz_counts = clean_tz.value_counts()
tz_counts.head()









    Out[19]:





America/New_York       1251
Unknown                 521
America/Chicago         400
America/Los_Angeles     382
America/Denver          191
Name: tz, dtype: int64



In [20]:

    
tz_counts.head(10).plot(kind='barh')









    Out[20]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f737d78a0d0>



In [21]:

    
df.shape









    Out[21]:





(3560, 18)



In [22]:

    
clean_df = df[df.a.notnull()]
clean_df.shape









    Out[22]:





(3440, 18)



In [23]:

    
operating_system = np.where(clean_df.a.str.contains('Windows'), 'Windows', 'Not Windows')
operating_system









    Out[23]:





array(['Windows', 'Not Windows', 'Windows', ..., 'Not Windows',
       'Not Windows', 'Windows'], 
      dtype='|S11')



In [24]:

    
operating_system.shape









    Out[24]:





(3440,)

by_tz_os = clean_df.groupby(['tz', operating_system]) agg_counts = by_tz_os.size().unstack().fillna(0) agg_counts[:10]

MovieLens 1M Data Set



In [25]:

    
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('data/ml-1m/users.dat', delimiter='::', header=None, names=unames, engine='python')

rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('data/ml-1m/ratings.dat', delimiter='::', header=None, names=rnames, engine='python')

mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('data/ml-1m/movies.dat', delimiter='::', header=None, names=mnames, engine='python')



In [26]:

    
users.head()



In [27]:

    
ratings.head()



In [28]:

    
movies.head()









    Out[28]:






  
    
      
      movie_id
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Animation|Children's|Comedy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children's|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy



In [29]:

    
data = pd.merge(pd.merge(ratings, users), movies)
data.head()









    Out[29]:






  
    
      
      user_id
      movie_id
      rating
      timestamp
      gender
      age
      occupation
      zip
      title
      genres
    
  
  
    
      0
      1
      1193
      5
      978300760
      F
      1
      10
      48067
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      1
      2
      1193
      5
      978298413
      M
      56
      16
      70072
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      2
      12
      1193
      4
      978220179
      M
      25
      12
      32793
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      3
      15
      1193
      4
      978199279
      M
      25
      7
      22903
      One Flew Over the Cuckoo's Nest (1975)
      Drama
    
    
      4
      17
      1193
      5
      978158471
      M
      50
      1
      95350
      One Flew Over the Cuckoo's Nest (1975)
      Drama



In [30]:

    
data.dtypes









    Out[30]:





user_id        int64
movie_id       int64
rating         int64
timestamp      int64
gender        object
age            int64
occupation     int64
zip           object
title         object
genres        object
dtype: object



In [31]:

    
data.loc[0, :]









    Out[31]:





user_id                                            1
movie_id                                        1193
rating                                             5
timestamp                                  978300760
gender                                             F
age                                                1
occupation                                        10
zip                                            48067
title         One Flew Over the Cuckoo's Nest (1975)
genres                                         Drama
Name: 0, dtype: object



In [32]:

    
data.columns









    Out[32]:





Index([u'user_id', u'movie_id', u'rating', u'timestamp', u'gender', u'age',
       u'occupation', u'zip', u'title', u'genres'],
      dtype='object')



In [33]:

    
data.loc[0, 'user_id']









    Out[33]:





1



In [34]:

    
mean_ratings = data.pivot_table('rating', index='title', columns='gender')
mean_ratings.head()









    Out[34]:






  
    
      gender
      F
      M
    
    
      title
      
      
    
  
  
    
      $1,000,000 Duck (1971)
      3.375000
      2.761905
    
    
      'Night Mother (1986)
      3.388889
      3.352941
    
    
      'Til There Was You (1997)
      2.675676
      2.733333
    
    
      'burbs, The (1989)
      2.793478
      2.962085
    
    
      ...And Justice for All (1979)
      3.828571
      3.689024



In [35]:

    
ratings_by_title = data.groupby('title').size().sort_values(ascending=False)
ratings_by_title.head()









    Out[35]:





title
American Beauty (1999)                                   3428
Star Wars: Episode IV - A New Hope (1977)                2991
Star Wars: Episode V - The Empire Strikes Back (1980)    2990
Star Wars: Episode VI - Return of the Jedi (1983)        2883
Jurassic Park (1993)                                     2672
dtype: int64



In [36]:

    
active_titles = ratings_by_title.index[ratings_by_title >= 250]
active_titles.size









    Out[36]:





1216



In [37]:

    
mean_ratings = mean_ratings.loc[active_titles, :]
mean_ratings.head()









    Out[37]:






  
    
      gender
      F
      M
    
    
      title
      
      
    
  
  
    
      American Beauty (1999)
      4.238901
      4.347301
    
    
      Star Wars: Episode IV - A New Hope (1977)
      4.302937
      4.495307
    
    
      Star Wars: Episode V - The Empire Strikes Back (1980)
      4.106481
      4.344577
    
    
      Star Wars: Episode VI - Return of the Jedi (1983)
      3.865237
      4.069058
    
    
      Jurassic Park (1993)
      3.579407
      3.814197



In [38]:

    
top_female_ratings = mean_ratings.sort_values(by='F', ascending=False)
top_female_ratings.head()









    Out[38]:






  
    
      gender
      F
      M
    
    
      title
      
      
    
  
  
    
      Close Shave, A (1995)
      4.644444
      4.473795
    
    
      Wrong Trousers, The (1993)
      4.588235
      4.478261
    
    
      Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)
      4.572650
      4.464589
    
    
      Wallace & Gromit: The Best of Aardman Animation (1996)
      4.563107
      4.385075
    
    
      Schindler's List (1993)
      4.562602
      4.491415



In [39]:

    
mean_ratings['diff'] = mean_ratings['M'] - mean_ratings['F']
mean_ratings.head()









    Out[39]:






  
    
      gender
      F
      M
      diff
    
    
      title
      
      
      
    
  
  
    
      American Beauty (1999)
      4.238901
      4.347301
      0.108400
    
    
      Star Wars: Episode IV - A New Hope (1977)
      4.302937
      4.495307
      0.192371
    
    
      Star Wars: Episode V - The Empire Strikes Back (1980)
      4.106481
      4.344577
      0.238096
    
    
      Star Wars: Episode VI - Return of the Jedi (1983)
      3.865237
      4.069058
      0.203821
    
    
      Jurassic Park (1993)
      3.579407
      3.814197
      0.234791



In [40]:

    
mean_ratings.sort_values(by='diff', ascending=False).head()









    Out[40]:






  
    
      gender
      F
      M
      diff
    
    
      title
      
      
      
    
  
  
    
      Good, The Bad and The Ugly, The (1966)
      3.494949
      4.221300
      0.726351
    
    
      Kentucky Fried Movie, The (1977)
      2.878788
      3.555147
      0.676359
    
    
      Dumb & Dumber (1994)
      2.697987
      3.336595
      0.638608
    
    
      Longest Day, The (1962)
      3.411765
      4.031447
      0.619682
    
    
      Cable Guy, The (1996)
      2.250000
      2.863787
      0.613787



In [41]:

    
rating_std_by_title = data.groupby('title').rating.std()
rating_std_by_title.head()









    Out[41]:





title
$1,000,000 Duck (1971)           1.092563
'Night Mother (1986)             1.118636
'Til There Was You (1997)        1.020159
'burbs, The (1989)               1.107760
...And Justice for All (1979)    0.878110
Name: rating, dtype: float64



In [42]:

    
rating_std_by_title.loc[active_titles].sort_values(ascending=False).head()









    Out[42]:





title
Dumb & Dumber (1994)                     1.321333
Blair Witch Project, The (1999)          1.316368
Natural Born Killers (1994)              1.307198
Tank Girl (1995)                         1.277695
Rocky Horror Picture Show, The (1975)    1.260177
Name: rating, dtype: float64

US Baby Names 1880-2010



In [43]:

    
data_path = 'data/names/yob2015.txt'
names2015 = pd.read_csv(data_path, names=['name', 'sex', 'births'])
names2015.head()



In [44]:

    
names2015.groupby('sex').births.sum()









    Out[44]:





sex
F    1769325
M    1898858
Name: births, dtype: int64



In [45]:

    
pieces = []

years = range(1880, 2016)
col_names = ['name', 'sex', 'births']

for year in years:
    path = 'data/names/yob%d.txt' % year
    frame = pd.read_csv(path, names=col_names)
    
    frame['year'] = year
    pieces.append(frame)

names = pd.concat(pieces, ignore_index=True)



In [46]:

    
names.shape









    Out[46]:





(1858689, 4)



In [47]:

    
names.head()



In [48]:

    
total_births = names.pivot_table('births', index='year', columns='sex', aggfunc=sum)
total_births.plot()









    Out[48]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7376945f10>

IPython: An Interactive Computing and Development Environment

Key features

Tab Completion
Introspection (?)
The %run command
Executing code from clipboard (%paste and %cpaste)
Magic Commands (%timeit)

Qt-based Rich GUI Console

$ ipython qtconsole --pylab=inline

Matplotlib Integration and Pylab Mode

$ ipython --pylab

Input and Output variables

The previous two outputs are stored in the _ (one underscore) and __ (two underscores) variables, respectively. Input variables are stored in variables named like _iX, where X is the input line number

Logging the input and output

IPython is capable of logging the entire console session including input and output. Logging is turned on by typing %logstart.

Companion functions %logoff, %logon, %logstate, and %logstop.

Interacting with the Operating System

One can perform most standard command line actions as you would in the Windows or UNIX (Linux, OS X) shell without having to exit IPython. This includes executing shell commands, changing directories, and storing the results of a command in a Python object (list or string).

Shell command and aliases

Starting a line in IPython with an exclamation point !, or bang, tells IPython to execute everything after the bang in the system shell.



In [49]:

    
!ls









    



Data analysis in Python with pandas.ipynb  README.md
LICENSE					   data
Learning Spark.ipynb			   derby.log
Python for Data Analysis.ipynb		   metastore_db

The %alias magic function can define custom shortcuts for shell commands.



In [50]:

    
%alias ll ls -l



In [51]:

    
ll /usr









    



total 60
dr-xr-xr-x  2 root root 12288 Jan 22 15:50 bin
drwxr-xr-x  2 root root  4096 Nov  5 15:38 etc
drwxr-xr-x  2 root root  4096 Nov  5 15:38 games
drwxr-xr-x 50 root root  4096 Jan 22 15:50 include
dr-xr-xr-x 25 root root  4096 Jan 22 15:50 lib
dr-xr-xr-x 38 root root 12288 Jan 22 15:50 lib64
drwxr-xr-x 13 root root  4096 Jan 22 15:50 libexec
drwxr-xr-x 26 root root  4096 Jan 22 16:00 local
dr-xr-xr-x  2 root root  4096 Jan 22 15:50 sbin
drwxr-xr-x 83 root root  4096 Jan 22 15:50 share
drwxr-xr-x  5 root root  4096 Dec 14 14:57 src
lrwxrwxrwx  1 root root    10 Dec 14 14:57 tmp -> ../var/tmp

Directory Bookmark System

IPython has a simple directory bookmarking system to enable you to save aliases for common directories so that you can jump around very easily.



In [52]:

    
%bookmark notebooks /usr/local/src/notebooks



In [53]:

    
cd notebooks









    



(bookmark:notebooks) -> /usr/local/src/notebooks
/usr/local/src/notebooks

Software Development Tools

IPython has closely integrated and enhanced the built-in Python pdb debugger. IPython has easy-to-use code timing and profiling tools.

Interactive Debugger

The %debug command, when entered immediately after an exception, invokes the “post-mortem” debugger and drops you into the stack frame where the exception was raised.

Executing the %pdb command makes it so that IPython automatically invokes the debugger after any exception, a mode that many users will find especially useful.

Timing Code: %time and %timeit

%time runs a statement once, reporting the total execution time.



In [54]:

    
strings = ['foo', 'foobar', 'baz', 'qux', 'python', 'Guido Van Rossum'] * 100000



In [55]:

    
%time method1 = [x for x in strings if x.startswith('foo')]









    



CPU times: user 180 ms, sys: 0 ns, total: 180 ms
Wall time: 183 ms



In [56]:

    
%time method2 = [x for x in strings if x[:3] == 'foo']









    



CPU times: user 80 ms, sys: 0 ns, total: 80 ms
Wall time: 85.2 ms

To get a more precise measurement, use the %timeit magic function. Given an arbitrary statement, it has a heuristic to run a statement multiple times to produce a fairly accurate average runtime.



In [57]:

    
%timeit [x for x in strings if x.startswith('foo')]









    



10 loops, best of 3: 119 ms per loop



In [58]:

    
%timeit [x for x in strings if x[:3] == 'foo']









    



10 loops, best of 3: 42.3 ms per loop

Basic Profiling: %prun and %run -p

Profiling code is closely related to timing code, except it is concerned with determining where time is spent. The main Python profiling tool is the cProfile module, which is not specific to IPython at all. cProfile executes a program or any arbitrary block of code while keeping track of how much time is spent in each function.



In [59]:

    
# python -m cProfile -s cumulative cprof_example.py

%prun takes the same “command line options” as cProfile but will profile an arbitrary Python statement instead of a while .py file

Profiling a Function Line-by-Line

%lprun that computes a line-by-line-profiling of one or more functions.

IPython HTML Notebook

It has a JSON-based .ipynb document format that enables easy sharing of code, output, and figures. The notebook application runs as a lightweight server process on the command line.

Tips for Productive Code Development Using IPython

Reloading Module Dependencies: Python’s “load-once” module system makes it difficult to reflect changes in imported libraries. Use reload(some_lib) and dreload(some_lib) after import statement.
Keep relevant objects and data alive
Flat is better than nested
Overcome a fear of longer files

Advanced IPython Features

Making Your Own Classes IPython-friendly: IPython takes the string returned by the __repr__ magic method (by doing output = repr(obj)) and prints that to the console. Thus, we can add a simple __repr__ method to the above class to get a more helpful output

Profiles and Configuration

All of these configuration options are specified in a special ipython_config.py file which will be found in the ~/.config/ipython/ directory on UNIX-like systems and %HOME %/.ipython/ directory on Windows.

NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is the fundamental package required for high performance scientific computing and data analysis.

Features:

ndarray, a fast and space-efficient multidimensional array providing vectorized arithmetic operations and sophisticated broadcasting capabilities
Standard mathematical functions for fast operations on entire arrays of data without having to write loops
Tools for reading/writing array data to disk and working with memory-mapped files
Linear algebra, random number generation, and Fourier transform capabilities
Tools for integrating code written in C, C++, and Fortran

The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large data sets in Python. An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array

Creating ndarrays



In [60]:

    
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

arr1









    Out[60]:





array([ 6. ,  7.5,  8. ,  0. ,  1. ])



In [61]:

    
arr1.shape









    Out[61]:





(5,)



In [62]:

    
arr1.dtype









    Out[62]:





dtype('float64')



In [63]:

    
np.arange(15)









    Out[63]:





array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Data Types for ndarrays



In [64]:

    
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float)









    Out[64]:





array([  1.25,  -9.6 ,  42.  ])

Operations between Arrays and Scalars

Arrays are important because they enable you to express batch operations on data without writing any for loops. This is usually called vectorization.

Basic Indexing and Slicing



In [65]:

    
arr = np.zeros((2, 2, 2))
arr[:1, :1, :1]









    Out[65]:





array([[[ 0.]]])

Boolean indexing



In [66]:

    
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
data = np.random.randn(7, 4)



In [67]:

    
data[names == "Bob", :]









    Out[67]:





array([[-0.35005798,  1.23133776, -0.50146386, -0.94850851],
       [-0.17489736,  0.26607805,  0.92171919, -0.21451063]])

Note: The Python keywords and and or do not work with boolean arrays.

Fancy indexing

To select out a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order

Transposing Arrays and Swapping Axes



In [68]:

    
arr = np.arange(15).reshape((3, 5))
arr









    Out[68]:





array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])



In [69]:

    
arr.T









    Out[69]:





array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])



In [70]:

    
arr.transpose((1, 0))









    Out[70]:





array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

Universal Functions: Fast Element-wise Array Functions

A universal function, or ufunc, is a function that performs elementwise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

Many ufuncs are simple elementwise transformations, like sqrt or exp

Data Processing Using Arrays

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as vectorization.

Expressing Conditional Logic as Array Operations

The numpy.where function is a vectorized version of the ternary expression x if condition else y

Mathematical and Statistical Methods

A set of mathematical functions which compute statistics about an entire array or about the data along an axis are accessible as array methods. Aggregations (often called reductions) like sum, mean, and standard deviation std can either be used by calling the array instance method or using the top level NumPy function

Methods for Boolean Arrays



In [71]:

    
arr = np.random.randn(100)
(arr > 0).sum()









    Out[71]:





48



In [72]:

    
bools = np.array([False, False, True, False])



In [73]:

    
bools.any()









    Out[73]:





True



In [74]:

    
bools.all()









    Out[74]:





False

Sorting

Like Python’s built-in list type, NumPy arrays can be sorted in-place using the sort method. Multidimensional arrays can have each 1D section of values sorted in-place along an axis by passing the axis number to sort. The top level method np.sort returns a sorted copy of an array instead of modifying the array in place.

Unique and Other Set Logic

np.unique returns the sorted unique values in an array.

File Input and Output with Arrays

Storing Arrays on Disk in Binary Format

np.save and np.load are the two workhorse functions for efficiently saving and loading array data on disk. Arrays are saved by default in an uncompressed raw binary format with file extension .npy.

Saving and Loading Text Files

np.loadtxt and np.genfromtxt are two functions to load data into vanilla NumPy arrays.

Linear Algebra

numpy.linalg has a standard set of matrix decompositions and things like inverse and determinant. These are implemented under the hood using the same industry-standard Fortran libraries used in other languages like MATLAB and R, such as like BLAS, LA- PACK, or possibly (depending on your NumPy build) the Intel MKL

Random Number Generation

The numpy.random module supplements the built-in Python random with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions.

Getting started with pandas



In [1]:

    
import pandas as pd
from pandas import Series, DataFrame

Introduction to pandas Data Structures

Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index.



In [2]:

    
obj = Series([4, 7, -5, 3])
obj









    Out[2]:





0    4
1    7
2   -5
3    3
dtype: int64

Dataframe

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index)



In [3]:

    
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
frame

Index objects

pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names). Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index. Index objects are immutable and thus can’t be modified by the user



In [4]:

    
obj = Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index









    Out[4]:





Index([u'a', u'b', u'c'], dtype='object')

Essential Functionality

Reindexing

reindex is to create a new object with the data conformed to a new index.



In [79]:

    
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj









    Out[79]:





d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64



In [80]:

    
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2









    Out[80]:





a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

Dropping entries from an axis

Dropping one or more entries from an axis is easy if you have an index array or list without those entries.



In [81]:

    
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj









    Out[81]:





a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64



In [82]:

    
obj.drop('c', axis=0)









    Out[82]:





a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers.



In [83]:

    
obj = Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj









    Out[83]:





a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64



In [84]:

    
obj['b']









    Out[84]:





1.0



In [85]:

    
obj.loc['b']









    Out[85]:





1.0



In [86]:

    
obj.ix['b']









    Out[86]:





1.0



In [87]:

    
obj.iloc[1]









    Out[87]:





1.0

Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between objects with different indexes. When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.



In [5]:

    
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])



In [6]:

    
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])



In [7]:

    
s1 + s2









    Out[7]:





a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64



In [8]:

    
s1.add(s2, fill_value=0)









    Out[8]:





a    5.2
c    1.1
d    3.4
e    0.0
f    4.0
g    3.1
dtype: float64

Function application and mapping



In [10]:

    
import numpy as np

frame = DataFrame(np.random.randn(4, 3), columns=list("bde"), index=["Utah", "Ohio", "Texas", "Oregon"])



In [11]:

    
np.abs(frame)



In [13]:

    
f = lambda x: x.max() - x.min()



In [14]:

    
frame.apply(f)









    Out[14]:





b    4.651193
d    1.593344
e    1.702650
dtype: float64



In [16]:

    
frame.apply(f, axis=1)









    Out[16]:





Utah      2.067423
Ohio      2.512260
Texas     1.028774
Oregon    3.403581
dtype: float64

Sorting and ranking



In [17]:

    
obj = Series(range(4), index=['d', 'a', 'b', 'c'])



In [18]:

    
obj









    Out[18]:





d    0
a    1
b    2
c    3
dtype: int64



In [19]:

    
obj.sort_index()









    Out[19]:





a    1
b    2
c    3
d    0
dtype: int64



In [20]:

    
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})



In [23]:

    
frame.sort_values('b')

Axis indexes with duplicate values

Up until now all of the examples I’ve showed you have had unique axis labels (index values). While many pandas functions (like reindex ) require that the labels be unique, it’s not mandatory.



In [24]:

    
obj = Series(range(5), index=['a', 'a', 'b', 'b', 'c'])



In [25]:

    
obj['a']









    Out[25]:





a    0
a    1
dtype: int64

Summarizing and Computing Descriptive Statistics



In [35]:

    
df = DataFrame(np.random.randn(4, 3), index=list("abcd"), columns=list("efg"))



In [36]:

    
df.sum()









    Out[36]:





e    2.429096
f   -0.546991
g    1.092330
dtype: float64



In [37]:

    
df.sum(axis=1)









    Out[37]:





a    0.828636
b    1.196924
c   -0.451114
d    1.399989
dtype: float64



In [38]:

    
df.mean()









    Out[38]:





e    0.607274
f   -0.136748
g    0.273082
dtype: float64



In [39]:

    
df.describe()

Correlation and Covariance



In [40]:

    
df = DataFrame(np.random.randn(4, 3), index=list("0123"), columns=list("abc"))



In [41]:

    
df



In [42]:

    
df.corr()



In [44]:

    
df.cov()

Unique Values, Value Counts, and Membership



In [45]:

    
obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])



In [46]:

    
obj.unique()









    Out[46]:





array(['c', 'a', 'd', 'b'], dtype=object)



In [47]:

    
obj.value_counts()









    Out[47]:





c    3
a    3
b    2
d    1
dtype: int64



In [48]:

    
obj.isin(['b', 'c'])









    Out[48]:





0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

Handling Missing Data

pandas uses the floating point value NaN (Not a Number) to represent missing data in both floating as well as in non-floating point arrays. It is just used as a sentinel that can be easily detected.



In [49]:

    
string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado'])



In [50]:

    
string_data.isnull()









    Out[50]:





0    False
1    False
2     True
3    False
dtype: bool

Filtering Out Missing Data



In [53]:

    
from numpy import nan as NA

data = Series([1, NA, 3.5, NA, 7])

data.dropna()









    Out[53]:





0    1.0
2    3.5
4    7.0
dtype: float64

Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value.



In [57]:

    
data.fillna(0)









    Out[57]:





0    1.0
1    0.0
2    3.5
3    0.0
4    7.0
dtype: float64

Hierarchical Indexing

Hierarchical indexing is an important feature of pandas enabling you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.



In [58]:

    
data = Series(np.random.randn(10), index=[['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]])



In [59]:

    
data









    Out[59]:





a  1    1.967507
   2    0.716929
   3   -0.398615
b  1    0.504018
   2    2.247966
   3   -0.610748
c  1   -0.595969
   2   -1.579997
d  2    0.087106
   3    0.059669
dtype: float64



In [60]:

    
data.index









    Out[60]:





MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]])

Reordering and Sorting Levels



In [61]:

    
frame = DataFrame(np.arange(12).reshape((4, 3)),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=[['Ohio', 'Ohio', 'Colorado'],
                           ['Green', 'Red', 'Green']])



In [66]:

    
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']



In [67]:

    
frame



In [68]:

    
frame.swaplevel("key1", "key2")



In [69]:

    
frame.swaplevel(0, 1).sortlevel(0)

Summary Statistics by Level



In [70]:

    
frame.sum(level='key2')

Using a DataFrame’s Columns



In [71]:

    
frame = DataFrame({'a': range(7), 'b': range(7, 0, -1),
                   'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                   'd': [0, 1, 2, 0, 1, 2, 3]})



In [73]:

    
frame



In [72]:

    
frame.set_index(['c', 'd'])

Data Loading, Storage, and File Formats

Input and output typically falls into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.

Reading and Writing Data in Text Format

pandas features a number of functions for reading tabular data as a DataFrame object. read_csv and read_table are likely the ones you’ll use the most. The options for these functions fall into a few categories:

Indexing: can treat one or more columns as the returned DataFrame, and whether to get column names from the file, the user, or not at all.
Type inference and data conversion: this includes the user-defined value conversions and custom list of missing value markers.
Datetime parsing: includes combining capability, including combining date and time information spread over multiple columns into a single column in the result.
Iterating: support for iterating over chunks of very large files.
Unclean data issues: skipping rows or a footer, comments, or other minor things like numeric data with thousands separated by commas.

Type inference is one of the more important features of these functions; that means you don’t have to specify which columns are numeric, integer, boolean, or string.

Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may only want to read in a small piece of a file or iterate through smaller chunks of the file.

If you want to only read out a small number of rows (avoiding reading the entire file), specify that with nrows.

Writing Data Out to Text Format

Using DataFrame’s to_csv method, we can write the data out to a comma-separated file. Series also has a to_csv method.

Manually Working with Delimited Formats

Most forms of tabular data can be loaded from disk using functions like pandas.read_table . In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table. For any file with a single-character delimiter, you can use Python’s built-in csv module. To use it, pass any open file or file-like object to csv.reader.

JSON Data

JSON (short for JavaScript Object Notation) has become one of the standard formats for sending data by HTTP request between web browsers and other applications. It is a much more flexible data format than a tabular text form like CSV.



In [76]:

    
obj = """{"name": "Wes",
"places_lived": ["United States", "Spain", "Germany"],
"pet": null,
"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},
{"name": "Katie", "age": 33, "pet": "Cisco"}]
}"""



In [77]:

    
obj









    Out[77]:





'{"name": "Wes",\n"places_lived": ["United States", "Spain", "Germany"],\n"pet": null,\n"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"},\n{"name": "Katie", "age": 33, "pet": "Cisco"}]\n}'

JSON is very nearly valid Python code with the exception of its null value null and some other nuances (such as disallowing trailing commas at the end of lists). The basic types are objects (dicts), arrays (lists), strings, numbers, booleans, and nulls. All of the keys in an object must be strings. There are several Python libraries for reading and writing JSON data. I’ll use json here as it is built into the Python standard library. To convert a JSON string to Python form, use json.loads.

json.dumps on the other hand converts a Python object back to JSON

XML and HTML: Web Scraping

Python has many libraries for reading and writing data in the ubiquitous HTML and XML formats. lxml (http://lxml.de) is one that has consistently strong performance in parsing very large files. lxml has multiple programmer interfaces; first I’ll show using lxml.html for HTML, then parse some XML using lxml.objectify.

Binary Data Formats

One of the easiest ways to store data efficiently in binary format is using Python’s built-in pickle serialization. Conveniently, pandas objects all have a save method which writes the data to disk as a pickle.

You read the data back into Python with pandas.load , another pickle convenience function.

Using HDF5 Format

There are a number of tools that facilitate efficiently reading and writing large amounts of scientific data in binary format on disk. A popular industry-grade library for this is HDF5, which is a C library with interfaces in many other languages like Java, Python, and MATLAB. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file contains an internal file system-like node structure enabling you to store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compressors, enabling data with repeated patterns to be stored more efficiently. For very large datasets that don’t fit into memory, HDF5 is a good choice as you can efficiently read and write small sections of much larger arrays.

There are not one but two interfaces to the HDF5 library in Python, PyTables and h5py, each of which takes a different approach to the problem. h5py provides a direct, but high-level interface to the HDF5 API, while PyTables abstracts many of the details of HDF5 to provide multiple flexible data containers, table indexing, querying capability, and some support for out-of-core computations.

Reading Microsoft Excel Files

pandas also supports reading tabular data stored in Excel 2003 (and higher) files using the ExcelFile class. Interally ExcelFile uses the xlrd and openpyxl packages, so you may have to install them first. To use ExcelFile , create an instance by passing a path to an xls or xlsx file.

Interacting with HTML and Web APIs

Many websites have public APIs providing data feeds via JSON or some other format. There are a number of ways to access these APIs from Python; one easy-to-use method that I recommend is the requests package (http://docs.python-requests.org).

Interacting with Databases

In many applications data rarely comes from text files, that being a fairly inefficient way to store large amounts of data. SQL-based relational databases (such as SQL Server, PostgreSQL, and MySQL) are in wide use, and many alternative non-SQL (so-called NoSQL) databases have become quite popular. The choice of database is usually dependent on the performance, data integrity, and scalability needs of an application.

Loading data from SQL into a DataFrame is fairly straightforward, and pandas has some functions to simplify the process. As an example, I’ll use an in-memory SQLite database using Python’s built-in sqlite3 driver.

Storing and Loading Data in MongoDB

pymongo is the official driver for MongoDB. Documents stored in MongoDB are found in collections inside databases. Each running instance of the MongoDB server can have multiple databases, and each database can have multiple collections.

Data Wrangling: Clean, Transform, Merge, Reshape

Much of the programming work in data analysis and modeling is spent on data preparation: loading, cleaning, transforming, and rearranging. Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application. Many people choose to do ad hoc processing of data from one form to another using a general purpose programming, like Python, Perl, R, or Java, or UNIX text processing tools like sed or awk.

Combining and Merging Data Sets

Data contained in pandas objects can be combined together in a number of built-in ways:

pandas.merge connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
pandas.concat glues or stacks together objects along an axis.
combine_first instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

Database-style DataFrame Merges

Merge or join operations combine data sets by linking rows using one or more keys. These operations are central to relational databases. The merge function in pandas is the main entry point for using these algorithms on your data.



In [79]:

    
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})

df2 = DataFrame({'key': ['a', 'b', 'd'],
                 'data2': range(3)})



In [80]:

    
df1.merge(df2)



In [81]:

    
pd.merge(df1, df2, on="key")

By default merge does an 'inner' join; the keys in the result are the intersection. Other possible options are 'left' , 'right' , and 'outer'.

Merging on Index

In some cases, the merge key or keys in a DataFrame will be found in its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index should be used as the merge key.

Concatenating Along an Axis

Another kind of data combination operation is alternatively referred to as concatenation, binding, or stacking. NumPy has a concatenate function for doing this with raw NumPy arrays.



In [82]:

    
arr = np.arange(12).reshape((3, 4))



In [86]:

    
np.concatenate([arr, arr], axis=1)









    Out[86]:





array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

Combining Data with Overlap

Another data combination situation can’t be expressed as either a merge or concatenation operation. You may have two datasets whose indexes overlap in full or part.



In [87]:

    
a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
           index=['f', 'e', 'd', 'c', 'b', 'a'])

b = Series(np.arange(len(a), dtype=np.float64),
           index=['f', 'e', 'd', 'c', 'b', 'a'])



In [88]:

    
b[-1] = np.nan



In [89]:

    
np.where(pd.isnull(a), b, a)









    Out[89]:





array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])



In [90]:

    
b[:-2].combine_first(a[2:])









    Out[90]:





a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

Reshaping and Pivoting

There are a number of fundamental operations for rearranging tabular data. These are alternatingly referred to as reshape or pivot operations.

Reshaping with Hierarchical Indexing

Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:

stack : this “rotates” or pivots from the columns in the data to the rows
unstack : this pivots from the rows into the columns



In [96]:

    
data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))



In [97]:

    
data



In [98]:

    
data.stack()









    Out[98]:





state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

Pivoting “long” to “wide” Format

Note that pivot is just a shortcut for creating a hierarchical index using set_index and reshaping with unstack.

Data Transformation

Removing Duplicates



In [99]:

    
data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})



In [100]:

    
data



In [102]:

    
data.duplicated()









    Out[102]:





0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool



In [103]:

    
data.drop_duplicates()

Transforming Data Using a Function or Mapping



In [3]:

    
import numpy as np
import pandas as pd
from pandas import *

data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
                           'corned beef', 'Bacon', 'pastrami', 'honey ham',
                           'nova lox'],
                  'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})



In [4]:

    
data









    Out[4]:






  
    
      
      food
      ounces
    
  
  
    
      0
      bacon
      4.0
    
    
      1
      pulled pork
      3.0
    
    
      2
      bacon
      12.0
    
    
      3
      Pastrami
      6.0
    
    
      4
      corned beef
      7.5
    
    
      5
      Bacon
      8.0
    
    
      6
      pastrami
      3.0
    
    
      7
      honey ham
      5.0
    
    
      8
      nova lox
      6.0



In [5]:

    
meat_to_animal = {
    'bacon': 'pig',
    'pulled pork': 'pig',
    'pastrami': 'cow',
    'corned beef': 'cow',
    'honey ham': 'pig',
    'nova lox': 'salmon'
}



In [6]:

    
data["animal"] = data["food"].map(str.lower).map(meat_to_animal)



In [7]:

    
data









    Out[7]:






  
    
      
      food
      ounces
      animal
    
  
  
    
      0
      bacon
      4.0
      pig
    
    
      1
      pulled pork
      3.0
      pig
    
    
      2
      bacon
      12.0
      pig
    
    
      3
      Pastrami
      6.0
      cow
    
    
      4
      corned beef
      7.5
      cow
    
    
      5
      Bacon
      8.0
      pig
    
    
      6
      pastrami
      3.0
      cow
    
    
      7
      honey ham
      5.0
      pig
    
    
      8
      nova lox
      6.0
      salmon

Renaming Axis Indexes



In [10]:

    
data.columns.map(str.upper)









    Out[10]:





array(['FOOD', 'OUNCES', 'ANIMAL'], dtype=object)



In [12]:

    
data.rename(columns=str.upper)









    Out[12]:






  
    
      
      FOOD
      OUNCES
      ANIMAL
    
  
  
    
      0
      bacon
      4.0
      pig
    
    
      1
      pulled pork
      3.0
      pig
    
    
      2
      bacon
      12.0
      pig
    
    
      3
      Pastrami
      6.0
      cow
    
    
      4
      corned beef
      7.5
      cow
    
    
      5
      Bacon
      8.0
      pig
    
    
      6
      pastrami
      3.0
      cow
    
    
      7
      honey ham
      5.0
      pig
    
    
      8
      nova lox
      6.0
      salmon

Discretization and Binning

Continuous data is often discretized or otherwised separated into “bins” for analysis.



In [13]:

    
ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]



In [14]:

    
bins = [18, 25, 35, 60, 100]



In [15]:

    
cats = pd.cut(ages, bins)



In [16]:

    
cats









    Out[16]:





[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, object): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]



In [18]:

    
cats.codes









    Out[18]:





array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)



In [19]:

    
cats.value_counts()









    Out[19]:





(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

Detecting and Filtering Outliers



In [20]:

    
np.random.seed(12345)

data = DataFrame(np.random.randn(1000, 4))



In [21]:

    
data.describe()









    Out[21]:






  
    
      
      0
      1
      2
      3
    
  
  
    
      count
      1000.000000
      1000.000000
      1000.000000
      1000.000000
    
    
      mean
      -0.067684
      0.067924
      0.025598
      -0.002298
    
    
      std
      0.998035
      0.992106
      1.006835
      0.996794
    
    
      min
      -3.428254
      -3.548824
      -3.184377
      -3.745356
    
    
      25%
      -0.774890
      -0.591841
      -0.641675
      -0.644144
    
    
      50%
      -0.116401
      0.101143
      0.002073
      -0.013611
    
    
      75%
      0.616366
      0.780282
      0.680391
      0.654328
    
    
      max
      3.366626
      2.653656
      3.260383
      3.927528



In [22]:

    
data[(np.abs(data) > 3).any(1)]

Permutation and Random Sampling



In [25]:

    
df = DataFrame(np.arange(5 * 4).reshape(5, 4))



In [28]:

    
sampler = np.random.permutation(5)
df.take(sampler)

Computing Indicator/Dummy Variables

Another type of transformation for statistical modeling or machine learning applications is converting a categorical variable into a “dummy” or “indicator” matrix. If a column in a DataFrame has k distinct values, you would derive a matrix or DataFrame containing k columns containing all 1’s and 0’s.



In [29]:

    
df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                'data1': range(6)})



In [30]:

    
df



In [31]:

    
pd.get_dummies(df['key'])

String Manipulation

Python has long been a popular data munging language in part due to its ease-of-use for string and text processing. Most text operations are made simple with the string object’s built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed. pandas adds to the mix by enabling you to apply string and regular expressions concisely on whole arrays of data, additionally handling the annoyance of missing data.

String Object Methods



In [37]:

    
s = 'a,b, guido'



In [38]:

    
s.split(',')









    Out[38]:





['a', 'b', ' guido']



In [39]:

    
[x.strip() for x in s.split(',')]









    Out[39]:





['a', 'b', 'guido']



In [42]:

    
','.join(['a', 'b', 'c'])









    Out[42]:





'a,b,c'

Regular expressions



In [44]:

    
import re

text = "foo      bar\t baz \tqux"



In [46]:

    
re.split('\s+', text)









    Out[46]:





['foo', 'bar', 'baz', 'qux']



In [47]:

    
regex = re.compile('\s+')



In [49]:

    
regex.split(text)









    Out[49]:





['foo', 'bar', 'baz', 'qux']



In [50]:

    
regex.findall(text)









    Out[50]:





['      ', '\t ', ' \t']

Creating a regex object with re.compile is highly recommended if you intend to apply the same expression to many strings; doing so will save CPU cycles.

Vectorized string functions in pandas



In [52]:

    
data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',
        'Rob': 'rob@gmail.com', 'Wes': np.nan}

data = Series(data)



In [55]:

    
data.str.contains("gmail")









    Out[55]:





Dave     False
Rob       True
Steve     True
Wes        NaN
dtype: object



In [57]:

    
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'



In [59]:

    
data.str.findall(pattern, flags=re.IGNORECASE)









    Out[59]:





Dave     [(dave, google, com)]
Rob        [(rob, gmail, com)]
Steve    [(steve, gmail, com)]
Wes                        NaN
dtype: object

Plotting and Visualization

It may be a part of the exploratory process; for example, helping identify outliers, needed data transformations, or coming up with ideas for models. matplotlib is a (primarily 2D) desktop plotting package designed for creating publication-quality plots. matplotlib has a number of add-on toolkits, such as mplot3d for 3D plots and basemap for mapping and projections.



In [60]:

    
import matplotlib.pyplot as plt

%matplotlib inline



In [61]:

    
plt.plot(np.arange(10))









    Out[61]:





[<matplotlib.lines.Line2D at 0x7fa6c7fd7c50>]

Figures and Subplots

Plots in matplotlib reside within a Figure object. You can create a new figure with plt.figure



In [79]:

    
fig = plt.figure()
ax1 = fig.add_subplot(2, 2, 1)
ax2 = fig.add_subplot(2, 2, 2)
ax3 = fig.add_subplot(2, 2, 3)

plt.plot(np.random.randn(50).cumsum(), 'k--')
_ = ax1.hist(np.random.randn(100), bins=20, color='k', alpha=0.3)
ax2.scatter(np.arange(30), np.arange(30) + 3 * np.random.randn(30))









    Out[79]:





<matplotlib.collections.PathCollection at 0x7fa6c53a2850>

Adjusting the spacing around subplots

The spacing can be most easily changed using the subplots_adjust Figure method. wspace and hspace controls the percent of the figure width and figure height, respectively, to use as spacing between subplots.

Colors, Markers, and Line Styles

The pyplot interface, designed for interactive use, consists of methods like xlim, xticks, and xticklabels . These control the plot range, tick locations, and tick labels, respectively. All such methods act on the active or most recently-created AxesSubplot . Each of them corresponds to two methods on the subplot object itself; in the case of xlim these are ax.get_xlim and ax.set_xlim.



In [89]:

    
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.plot(np.random.randn(1000).cumsum())
ticks = ax.set_xticks([0, 250, 500, 750, 1000])
ax.set_title('My first matplotlib plot')
ax.set_xlabel('Stages')









    Out[89]:





<matplotlib.text.Text at 0x7fa6c5692c50>

Adding legends



In [92]:

    
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.plot(np.random.randn(1000).cumsum(), 'k', label='one')
ax.plot(np.random.randn(1000).cumsum(), 'k--', label='two')
ax.plot(np.random.randn(1000).cumsum(), 'k.', label='three')

ax.legend(loc='best')









    Out[92]:





<matplotlib.legend.Legend at 0x7fa6c571ef90>

Annotations and Drawing on a Subplot

Annotations and text can be added using the text , arrow , and annotate functions. text draws text at given coordinates (x, y) on the plot with optional custom styling.

Drawing shapes requires some more care. matplotlib has objects that represent many common shapes, referred to as patches. Some of these, like Rectangle and Circle are found in matplotlib.pyplot , but the full set is located in matplotlib.patches.

Saving Plots to File

The active figure can be saved to file using plt.savefig . This method is equivalent to the figure object’s savefig instance method.

matplotlib Configuration

matplotlib comes configured with color schemes and defaults that are geared primarily toward preparing figures for publication. Fortunately, nearly all of the default behavior can be customized via an extensive set of global parameters governing figure size, subplot spacing, colors, font sizes, grid styles, and so on. There are two main ways to interact with the matplotlib configuration system. The first is programmatically from Python using the rc method.

For more extensive customization and to see a list of all the options, matplotlib comes with a configuration file matplotlibrc in the matplotlib/mpl-data directory. If you customize this file and place it in your home directory titled .matplotlibrc , it will be loaded each time you use matplotlib.

Plotting Functions in pandas

matplotlib is actually a fairly low-level tool. Therefore, pandas has an increasing number of high-level plotting methods for creating standard visualizations that take advantage of how data is organized in DataFrame objects.

Line Plots

Series and DataFrame each have a plot method for making many different plot types. By default, they make line plots.



In [6]:

    
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

s = pd.Series(np.random.randn(10).cumsum(), index=np.arange(0, 100, 10))



In [7]:

    
s.plot()









    Out[7]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2d4edeb990>



In [9]:

    
df = pd.DataFrame(np.random.randn(10, 4).cumsum(0),
               columns=['A', 'B', 'C', 'D'],
               index=np.arange(0, 100, 10))



In [11]:

    
df



In [10]:

    
df.plot()









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f2d4ed7d810>

Bar Plots

Making bar plots instead of line plots is a simple as passing kind='bar' (for vertical bars) or kind='barh' (for horizontal bars). In this case, the Series or DataFrame index will be used as the X ( bar ) or Y ( barh ) ticks.



In [13]:

    
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop')).plot(kind="bar")



In [14]:

    
data = pd.Series(np.random.rand(16), index=list('abcdefghijklmnop')).plot(kind="barh")

Histograms and Density Plots

A histogram, with which you may be well-acquainted, is a kind of bar plot that gives a discretized display of value frequency. The data points are split into discrete, evenly spaced bins, and the number of data points in each bin is plotted.

A related plot type is a density plot, which is formed by computing an estimate of a continuous probability distribution that might have generated the observed data. A usual procedure is to approximate this distribution as a mixture of kernels , that is, simpler distributions like the normal (Gaussian) distribution.

Scatter Plots

Scatter plots are a useful way of examining the relationship between two one-dimensional data series. matplotlib has a scatter plotting method that is the workhorse of making these kinds of plots.

Plotting Maps: Visualizing Haiti Earthquake Crisis Data

Ushahidi is a non-profit software company that enables crowdsourcing of information related to natural disasters and geopolitical events via text message.

Python Visualization Tool Ecosystem

Chaco: Chaco developed by Enthought, is a plotting toolkit suitable both for static plotting and interactive visualizations. It is especially wellsuited for expressing complex visualizations with data interrelationships.
mayavi is a 3D graphics toolkit built on the open source C++ graphics library VTK. The plots can be panned, rotated, and zoomed using the mouse and keyboard.

Data Aggregation and Group Operations

GroupBy Mechanics

GroupBy uses split-apply-combine:

In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object.



In [3]:

    
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.random.randn(5),
                'data2' : np.random.randn(5)})



In [4]:

    
df



In [11]:

    
df['data1'].groupby(df['key1']).mean()









    Out[11]:





key1
a    0.693546
b   -0.287836
Name: data1, dtype: float64



In [12]:

    
df['data1'].groupby([df['key1'], df['key2']]).mean()









    Out[12]:





key1  key2
a     one     0.260091
      two     1.560457
b     one     1.121373
      two    -1.697046
Name: data1, dtype: float64

Iterating Over Groups



In [16]:

    
for name, group in df.groupby('key1'):
    print name, group









    



a       data1     data2 key1 key2
0  0.512817 -0.893407    a  one
1  1.560457 -1.520848    a  two
4  0.007365  0.323828    a  one
b       data1     data2 key1 key2
2  1.121373 -1.471139    b  one
3 -1.697046 -1.469594    b  two



In [17]:

    
dict(list(df.groupby('key1')))









    Out[17]:





{'a':       data1     data2 key1 key2
 0  0.512817 -0.893407    a  one
 1  1.560457 -1.520848    a  two
 4  0.007365  0.323828    a  one, 'b':       data1     data2 key1 key2
 2  1.121373 -1.471139    b  one
 3 -1.697046 -1.469594    b  two}

Selecting a Column or Subset of Columns



In [21]:

    
list(df.groupby('key1')['data1'])









    Out[21]:





[('a', 0    0.512817
  1    1.560457
  4    0.007365
  Name: data1, dtype: float64), ('b', 2    1.121373
  3   -1.697046
  Name: data1, dtype: float64)]



In [22]:

    
list(df['data1'].groupby(df['key1']))









    Out[22]:





[('a', 0    0.512817
  1    1.560457
  4    0.007365
  Name: data1, dtype: float64), ('b', 2    1.121373
  3   -1.697046
  Name: data1, dtype: float64)]

Grouping with Dicts and Series



In [26]:

    
people = pd.DataFrame(np.random.randn(5, 5),
                   columns=['a', 'b', 'c', 'd', 'e'],
                   index=['Joe', 'Steve', 'Wes', 'Jim', 'Travis'])



In [27]:

    
mapping = {'a': 'red', 'b': 'red', 'c': 'blue',
           'd': 'blue', 'e': 'red', 'f' : 'orange'}



In [29]:

    
people.groupby(mapping, axis=1).mean()



In [31]:

    
map_series = pd.Series(mapping)



In [32]:

    
people.groupby(map_series, axis=1).count()

Grouping with Functions



In [33]:

    
people.groupby(len).sum()

Grouping by Index Levels



In [34]:

    
columns = pd.MultiIndex.from_arrays([['US', 'US', 'US', 'JP', 'JP'],
                                     [1, 3, 5, 1, 3]], names=['cty', 'tenor'])



In [36]:

    
hier_df = pd.DataFrame(np.random.randn(4, 5), columns=columns)



In [37]:

    
hier_df



In [41]:

    
hier_df.groupby(level='cty', axis=1).count()

Data Aggregation

Any data transformation that produces scalar values from arrays. For e.g mean, count, min and sum.

Column-wise and Multiple Function Application



In [50]:

    
df = pd.DataFrame(np.random.randn(1, 12).reshape(3, 4))



In [51]:

    
df



In [63]:

    
df.groupby(0).agg('mean')

Returning Aggregated Data in “unindexed” Form

You can disable this behavior in most cases by passing as_index=False to groupby.

	name	sex	births
0	Emma	F	20355
1	Olivia	F	19553
2	Sophia	F	17327
3	Ava	F	16286
4	Isabella	F	15504

	b	d	e
Utah	1.848206	0.123403	0.219217
Ohio	2.802987	0.844892	0.290727
Texas	0.687973	1.716746	1.143104
Oregon	1.481714	1.220571	1.921867

	e	f	g
count	4.000000	4.000000	4.000000
mean	0.607274	-0.136748	0.273082
std	0.256833	0.745186	0.873079
min	0.260450	-0.901977	-0.333242
25%	0.495393	-0.509237	-0.321121
50%	0.662873	-0.259593	-0.049811
75%	0.774754	0.112896	0.544392
max	0.842899	0.874171	1.525194

	a	b	c
0	-0.905641	1.609704	1.191921
1	0.321350	-0.215985	0.462305
2	-1.888820	-0.470837	0.391887
3	-0.078163	0.249111	-0.538326

	a	b	c
a	1.000000	0.033143	-0.320623
b	0.033143	1.000000	0.536523
c	-0.320623	0.536523	1.000000

	_heartbeat_	a	al	c	cy	g	gr	h	hc	hh	kw	l	ll	nk	r	t	tz	u
0	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	en-US,en;q=0.8	US	Danvers	A6qOVH	MA	wfLQtf	1.331823e+09	1.usa.gov	NaN	orofrog	[42.576698, -70.954903]	1.0	http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/...	1.331923e+09	America/New_York	http://www.ncbi.nlm.nih.gov/pubmed/22415991
1	NaN	GoogleMaps/RochesterNY	NaN	US	Provo	mwszkS	UT	mwszkS	1.308262e+09	j.mp	NaN	bitly	[40.218102, -111.613297]	0.0	http://www.AwareMap.com/	1.331923e+09	America/Denver	http://www.monroecounty.gov/etc/911/rss.php
2	NaN	Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...	en-US	US	Washington	xxr3Qb	DC	xxr3Qb	1.331920e+09	1.usa.gov	NaN	bitly	[38.9007, -77.043098]	1.0	http://t.co/03elZC4Q	1.331923e+09	America/New_York	http://boxer.senate.gov/en/press/releases/0316...
3	NaN	Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...	pt-br	BR	Braz	zCaLwp	27	zUtuOu	1.331923e+09	1.usa.gov	NaN	alelex88	[-23.549999, -46.616699]	0.0	direct	1.331923e+09	America/Sao_Paulo	http://apod.nasa.gov/apod/ap120312.html
4	NaN	Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...	en-US,en;q=0.8	US	Shrewsbury	9b6kNl	MA	9b6kNl	1.273672e+09	bit.ly	NaN	bitly	[42.286499, -71.714699]	0.0	http://www.shrewsbury-ma.gov/selco/	1.331923e+09	America/New_York	http://www.shrewsbury-ma.gov/egov/gallery/1341...

	user_id	movie_id	rating	timestamp
0	1	1193	5	978300760
1	1	661	3	978302109
2	1	914	3	978301968
3	1	3408	4	978300275
4	1	2355	5	978824291

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

	user_id	movie_id	rating	timestamp	gender	age	occupation	zip	title	genres
0	1	1193	5	978300760	F	1	10	48067	One Flew Over the Cuckoo's Nest (1975)	Drama
1	2	1193	5	978298413	M	56	16	70072	One Flew Over the Cuckoo's Nest (1975)	Drama
2	12	1193	4	978220179	M	25	12	32793	One Flew Over the Cuckoo's Nest (1975)	Drama
3	15	1193	4	978199279	M	25	7	22903	One Flew Over the Cuckoo's Nest (1975)	Drama
4	17	1193	5	978158471	M	50	1	95350	One Flew Over the Cuckoo's Nest (1975)	Drama

gender	F	M
title
$1,000,000 Duck (1971)	3.375000	2.761905
'Night Mother (1986)	3.388889	3.352941
'Til There Was You (1997)	2.675676	2.733333
'burbs, The (1989)	2.793478	2.962085
...And Justice for All (1979)	3.828571	3.689024

gender	F	M
title
American Beauty (1999)	4.238901	4.347301
Star Wars: Episode IV - A New Hope (1977)	4.302937	4.495307
Star Wars: Episode V - The Empire Strikes Back (1980)	4.106481	4.344577
Star Wars: Episode VI - Return of the Jedi (1983)	3.865237	4.069058
Jurassic Park (1993)	3.579407	3.814197

gender	F	M
title
Close Shave, A (1995)	4.644444	4.473795
Wrong Trousers, The (1993)	4.588235	4.478261
Sunset Blvd. (a.k.a. Sunset Boulevard) (1950)	4.572650	4.464589
Wallace & Gromit: The Best of Aardman Animation (1996)	4.563107	4.385075
Schindler's List (1993)	4.562602	4.491415

gender	F	M	diff
title
Good, The Bad and The Ugly, The (1966)	3.494949	4.221300	0.726351
Kentucky Fried Movie, The (1977)	2.878788	3.555147	0.676359
Dumb & Dumber (1994)	2.697987	3.336595	0.638608
Longest Day, The (1962)	3.411765	4.031447	0.619682
Cable Guy, The (1996)	2.250000	2.863787	0.613787

	name	sex	births	year
0	Mary	F	7065	1880
1	Anna	F	2604	1880
2	Emma	F	2003	1880
3	Elizabeth	F	1939	1880
4	Minnie	F	1746	1880

	pop	state	year
0	1.5	Ohio	2000
1	1.7	Ohio	2001
2	3.6	Ohio	2002
3	2.4	Nevada	2001
4	2.9	Nevada	2002

	a	b	c
a	0.956650	0.030051	-0.222441
b	0.030051	0.859382	0.352797
c	-0.222441	0.352797	0.503139

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.067684	0.067924	0.025598	-0.002298
std	0.998035	0.992106	1.006835	0.996794
min	-3.428254	-3.548824	-3.184377	-3.745356
25%	-0.774890	-0.591841	-0.641675	-0.644144
50%	-0.116401	0.101143	0.002073	-0.013611
75%	0.616366	0.780282	0.680391	0.654328
max	3.366626	2.653656	3.260383	3.927528

	0	1	2	3
5	-0.539741	0.476985	3.248944	-1.021228
97	-0.774363	0.552936	0.106061	3.927528
102	-0.655054	-0.565230	3.176873	0.959533
305	-2.315555	0.457246	-0.025907	-3.399312
324	0.050188	1.951312	3.260383	0.963301
400	0.146326	0.508391	-0.196713	-3.745356
499	-0.293333	-0.242459	-3.056990	1.918403
523	-3.428254	-0.296336	-0.439938	-0.867165
586	0.275144	1.179227	-3.184377	1.369891
808	-0.362528	-3.548824	1.553205	-2.186301
900	3.366626	-2.372214	0.851010	1.332846

	A	B	C	D
0	-0.763690	1.369318	-0.066815	1.331280
10	-0.499910	1.628217	-0.888078	0.451544
20	-1.749792	2.628480	-0.917734	1.158027
30	-0.950075	1.865583	-0.959956	3.229170
40	-1.404377	2.759683	0.242584	4.807047
50	0.405639	2.536974	0.973465	5.450812
60	-0.924917	2.475093	-0.634779	6.079676
70	-0.286045	3.065248	-0.781012	6.117680
80	0.068423	2.214351	-0.845943	7.214970
90	0.631877	1.420943	-2.523343	7.043895

	data1	data2	key1	key2
0	0.512817	-0.893407	a	one
1	1.560457	-1.520848	a	two
2	1.121373	-1.471139	b	one
3	-1.697046	-1.469594	b	two
4	0.007365	0.323828	a	one

	blue	red
Joe	0.804220	0.731066
Steve	-0.707534	-0.099224
Wes	-0.368887	0.370112
Jim	-1.064491	0.211184
Travis	-0.214689	0.105143

	a	b	c	d	e
3	1.521231	-0.100978	1.474969	-2.733286	2.516833
5	0.427189	-0.958618	-1.223534	-0.191534	0.233757
6	0.023070	0.484622	0.860524	-1.289902	-0.192263

cty	US			JP
tenor	1	3	5	1	3
0	-0.707898	1.038752	1.089531	0.413817	-0.471505
1	0.858247	-2.164974	-0.278535	1.069772	1.340006
2	1.954350	1.167490	0.039054	-0.549778	0.587933
3	0.823581	-0.417915	-1.388655	-0.605830	-0.757990

	0	1	2	3
0	-0.926953	-0.459265	-0.676452	0.634006
1	1.043291	0.376488	-0.180392	0.761147
2	-0.764387	0.153994	1.361896	0.104369